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ABSTRACT 

Summary: Genetic data obtained on population samples convey 
information about their evolutionary history. Inference methods can 
extract part of this information but they require sophisticated sta- 
tistical techniques that have been made available to the biologist 
community (through computer programs) only for simple and stan- 
dard situations typically involving a small number of samples. We 
propose here a computer program {DIYABC) for inference based 
on Approximate Bayesian Computation (ABC), in which scenarios can 
be customized by the user to fit many complex situations involving 
any number of populations and samples. Such scenarios involve any 
combination of population divergences, admixtures and population 
size changes. DIYABC can be used to compare competing scena- 
rios, estimate parameters for one or more scenarios, and compute 
bias and precision measures for a given scenario and known values 
of parameters (the current version applies to unlinked microsatellite 
data). This article describes key methods used in the program and 
provides its main features. The analysis of one simulated and one 
real data set, both with complex evolutionary scenarios, illustrates 
the main possibilities of DIYABC. 

Availability: The software DIYABC is freely available at 

http://www.montpellier.inra.fr/CBGP/diyabc. 

Supplementary material: Supplementary data are also available at 

http://www.montpellier.inra.fr/CBGP/diyabc 

Contact: j.cornuet@imperial.ac.uk 
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1 INTRODUCTION 

Until now, most literature and software about inference in popu- 
lation genetics concern simple standard evolutionary scenarios: a 
single population (Griffiths and Tavare, 1994; Stephens and Don- 
nelly, 2000; Beaumont, 1999), two populations exchanging genes 
(Hey and Nielsen, 2004; De Iorio and Griffiths, 2004) or not 
(Hickerson et al. , 2007) or three populations in the classic admixture 
scheme (Wang, 2003; Excoffier et al, 2005). The main exception to 
our knowledge is the computer program BATWING (Wilson et 
al, 2003) which considers a whole family of scenarios in which 
an ancestral population splits into as many subpopulations as nee- 
ded. However, in practice, population geneticists collect and analyse 
samples which rarely correspond to one of these standard scenarios. 
If they want to apply methods developed in the literature and for 
which computer programs are available, they have to select subsets 
of samples (to fit these standard situations), at the price of lowering 
the power of the analysis. The other solution is to develop their own 
software, which requires specific skills or the right collaborators. 
Rare examples of inference in non standard scenarios can be found 
in O'Ryan et al. (1998) including 3 populations and two successive 
divergences, or Estoup et al. (2004) (10 populations that sequenti- 
ally diverged with initial bottlenecks and exchanging migrants with 
neighbouring populations). 

Inference in complex evolutionary scenarios can be performed in 
various ways, but all are based on the genealogical tree of sampled 
genes and coalescence theory. A first approach used in programs 
such as IM (Hey and Nielsen, 2004) or BATWING consists of 
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starting from a gene genealogy compatible with the observed data 
and exploring the parameter and genealogy space through MCMC 
algorithms. One difficulty with this approach is to be sure that 
the MCMC has converged, because of the huge dimension of the 
parameter space. With a complex scenario, the difficulty is incre- 
ased. Also, although not impossible, it seems quite challenging to 
write a program that would deal with very different scenarios. A 
second approach pioneered by Beaumont (2003) consists in com- 
bining MCMC exploration of the scenario parameter space with an 
Importance Sampling (IS) based estimation of the likelihood. The 
strength of this approach is that the low number of parameters ensu- 
res a (relatively) fast convergence of the MCMC. Its weakness is that 
the likelihood is only approximated through IS, sometimes resulting 
in poor acceptance rates. 

When dealing with complex situations, the two previous approa- 
ches raise difficulties which mainly stem in the computation of the 
likelihood. Consequently, a line of research including the works of 
Tavare et al. (1997), Weiss and von Haeseler (1998), Pritchard et al. 
(1999) and Marjoram et al. (2003) developed a new approach ter- 
med Approximate Bayesian Computation (or ABC) by Beaumont 
et al. (2002). In this approach, the likelihood criterion is replaced 
by a similarity criterion between simulated and observed data sets, 
similarity usually measured by a distance between summary stati- 
stics computed on both data sets. Among examples of inference in 
complex scenarios given above, all but one (the simplest) have used 
this approach, showing that it can indeed solve complex problems. 

The ABC approach presents two additional features that can be of 
interest for experimental biologist. One characteristic, already noted 
by Excoffier et al. (2005), is the possibility to assess the bias and 
precision of estimates for simulated data sets produced with known 
values of parameters with little extra computational cost. To get the 
same information with likelihood-based methods would require a 
huge amount of additional computation whereas, with ABC, the lar- 
gest proportion of computation used for estimating parameters can 
be recycled in a bias/precision analysis. The second feature is the 
simple way by which the posterior probability of different scena- 
rios applied to the same data set can be estimated (e.g. Miller et al., 
2005; Pascual et a/., 2007). 

In its current state, the ABC approach remains inaccessible to 
most biologists because there is not yet a simple software solution. 
Therefore, we developed the program DIYABC that performs 
ABC analyses on complex scenarios, i.e. which include any num- 
ber of populations and samples (samples possibly taken at different 
times), with populations related by divergence and/or admixture 
events and possibly experiencing changes of population size. The 
current version is restricted to unlinked microsatellite data. In this 
article, we describe the rationale for some methods involved in the 
program. Then we give the main features of DIYABC and we pro- 
vide two complete example analyses performed with this program 
to illustrate its possibilities. 



2 KEY METHODS INVOLVED IN DIYABC 

Inference about the posterior distribution of parameters in an ABC 
analysis is usually performed in three steps (see Figure SI in Sup- 
plementary material). The first one is a simulation step in which a 
very large table (the reference table) is produced and recorded. Each 
row corresponds to a simulated data set and contains the parameter 



values used to simulate the data set and summary statistics computed 
on the simulated data set. Parameter values are drawn from prior dis- 
tributions. Using these parameter values, genetic data are simulated 
as explained in the next section. The summary statistics correspond 
to those traditionally used by population geneticists to characterize 
the genetic diversity within and among samples (e.g. number of alle- 
les, genie diversity and genetic distances). The idea is to extract 
maximum genetic information from the data, admitting that exhau- 
stivity or sufficiency are generally out of reach. The simulation 
step is generally the most time-consuming step since the number 
of simulated data sets can reach several millions. The second step 
is a rejection step. Euclidian distances between each simulated and 
the observed data set in the space of summary statistics are com- 
puted and only the simulated data sets closest to the observed data 
set are retained. The parameter values used to simulate these selec- 
ted data sets provide a sample of parameter values approximately 
distributed according to their own posterior distribution. Beaumont 
et al. (2002) have shown that a local linear regression (third step 
= estimation step) provides a better approximation of the posterior 
distribution. 

This synoptic of ABC is well established and we now concentrate 
on more specific issues that are implemented in DIYABC. 



2.1 Simulating genetic data in complex scenarios 

Thanks to coalescence theory, it has become easy to simulate data 
sets by a two steps procedure. The first step consists of building 
a genealogical tree of sampled genes according to rather simple 
rules provided by coalescence theory (see below). The second step 
consists of attributing allelic states to all the nodes of the genea- 
logy, starting from the common ancestor and simulating mutations 
according to the mutation model of the genetic markers. In a com- 
plex scenario, only the first step needs special attention and we will 
concentrate on it now. 

In a single isolated population of constant effective size, the 
genealogical tree of a sample of genes is simulated backward in 
time: starting from the time of sampling, the gene lineages are 
merged (coalesced) at times that are drawn from an exponential dis- 
tribution with rate j(j — 1) /4iV e , when there are j distinct lineages 
and the (diploid) effective population size is N e . The genealogical 
tree is completed when there remains a single lineage. 

Consider now two isolated populations (effective population sizes 
Ni and N2 respectively) that diverged t d generations before their 
common sampling time. Since the two populations do not exchange 
genes, lineages within each population will coalesce independently. 
Coalescence simulation will stop either when there remains a single 
lineage or when the simulated time is beyond the divergence (loo- 
king back in time). In the latter case, the coalescence event is simply 
discarded. At generation td, the remaining lineages are simply poo- 
led and will coalesce in the ancestral population. Because of the 
memoryless property of the exponential distribution, the time to the 
first coalescence in the ancestral population is independent of the 
times of the last coalescence in each daughter population and can 
be simulated as in the single isolated population above. Again, the 
genealogical tree is completed when there remains a single lineage 
in the ancestral population. Note that the two populations need not 
be sampled at the same generation since this has no bearing on the 
simulation process. 
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Consider now the classic admixture scenario with one admixed 
and two parental populations, as in Figure 1 in Excoffier et al. 
(2005). Simulating the complete genealogical tree can be achieved 
with the following steps: i) coalesce gene lineages in each popu- 
lation independently until reaching admixture time, ii) distribute 
remaining lineages of the admixed population among the two paren- 
tal populations, each with a Bernoulli draw with probability equal 
to the admixture rate, iii) coalesce gene lineages in the two parental 
populations until reaching their divergence time, iv) pool the remai- 
ning gene lineages of the two parental populations and place them 
into the ancestral population and v) coalesce gene lineages in the 
ancestral population. 

We first note the modular form of this algorithm which involves 
only three modules: 

1. a module that performs coalescences in an isolated constant 
size population between two given times, 

2. a module that pools gene lineages from two populations (for 
divergence), 

3. a module that splits gene lineages from the admixed population 
between two parental populations (for admixture). 

We also note that the last two modules are quite simple and that 
the first one might be extended to include population size variations. 

We have introduced a fourth module that proves useful in many 
instances. It performs the (simple) task of adding a gene sample to 
a population at a given generation. The interest of this module is to 
allow for multiple samples of the same population taken at different 
generations. By combining the aforementioned four modules, it is 
possible to simulate genetic data involving any number of populati- 
ons according to a scenario that can include divergence, admixture 
events as well as population size variations. In addition, populati- 
ons can be sampled more than once at different times. Compared 
to our previous definition of complex scenarios, the only restric- 
tion so far concerns the absence of migrations among populations. 
If migrations have to be taken into account, coalescences in two (or 
more) populations exchanging migrants are no longer independent 
and should be treated in the same module. Such a module would 
require to consider simultaneously two kind of events, coalescences 
of lineages within population and migrations of gene lineages from 
one population to another. In the current stage of DIYABC, this 
has not yet been achieved. 

2.2 Two ways of simulating colescence events 

Simulating coalescences can be performed in two ways. The most 
traditional way is based on the usually fulfilled assumption that the 
effective population size is large enough so that the probability of 
coalescence is small and hence that the probability that two or more 
coalescences occur at the same generation is low enough so that it 
can be neglected (e.g. Nordborg, 2007). Time is then considered 
as a continuous variable in computations. The corresponding algo- 
rithm, called here the continuous time (CT) algorithm, consists in 
drawing first times between two successive coalescence events and 
then drawing 2 lineages at random at each coalescence event. 

However, in practice, population size can be so small (e.g. during 
a bottleneck) that multiple coalescences at the same generation 
become common, including with the same parental gene (producing 
multifurcating trees). Simulating gene genealogies with multiple 



coalescences is possible, (e.g. Laval and Excoffier, 2004). In effect, 
lineages are reconstructed one generation at a time: lineages exi- 
sting at generation g are given a random number drawn in U\\, 2N e ] 
and lineages with the same number coalesce together. The latter is 
termed here the generation by generation (GbG) algorithm. 

The CT algorithm is much faster in most cases and is used in most 
softwares, but in some circumstances, the approximation becomes 
unacceptable. The solution taken in DIYABC is to swap between 
the two algorithms according to a criterion based on the effective 
population size, the time during which the effective size keeps its 
value, and the number of lineages at the start of the module. The 
criterion is such that the generation per generation (GbG) algorithm 
is taken whenever it is faster (this occurs when the effective size is 
very small) or when the continuous time (CT) algorithm overesti- 
mates by more than 5% on average the number of lineages at the 
end of the module. 

A specific comparison study has been performed to establish 
this criterion. For different time periods counted in number of 
generations (g), effective population sizes (N e ) and number of ente- 
ring lineages (ri ei ), coalescences have been simulated according to 
each algorithm 10,000 times and the average number of remaining 
lineages at the end of the period have been recorded as well as 
the average computation duration of each algorithm. Our results 
(Figure S2) show that the following rules optimize computation 
speed while keeping the relative bias in coalescence rates under the 
5% threshold: 

if (1 < g < 30)doCTifn ei /iV e < 0.0031# 2 - 0.053# + 0.7197 
else do GbG 

if (30 < g < 100)doCTif n el /N e < 0.033^ + 1.7 else do GbG 
if (100 < g) do CT if n e i/N e < 5 else do GbG 

2.3 Comparing scenarios 

Using ABC to compare different scenarios and infer their poste- 
rior probability has been performed in two ways in the literature. 
Starting with a reference table containing parameters and summary 
statistics obtained with the different scenarios to be compared (or 
pooling reference tables, each obtained with a given scenario), data 
sets are ordered by increasing distance to the observed data set. A 
first method (termed hereafter the direct approach) is to take as an 
estimate of the posterior probability of a scenario the proportion of 
data sets obtained with this scenario in the ns closest data sets (Mil- 
ler et al, 2005; Pascual et al., 2007). The value of ns is arbitrary 
and unless the results are quite clear cut, the estimated posterior 
probability may vary with ns . 

Following the same rationale that introduced the local linear 
regression in the estimation of posterior distributions for parame- 
ters (Beaumont et al. , 2002), we perform a weighted polychotomous 
logistic regression to estimate the posterior probability of scenarios, 
termed hereafter the logistic approach (see also Fagundes et al, 
2007; Beaumont, 2008). In the estimation of parameters, a linear 
regression is performed with dependent variable the parameter and 
predictors the differences between the observed and simulated sta- 
tistics. This linear regression is local at the point (in the predictor 
space) corresponding to the observed data set, using an Epanechni- 
kov kernel based on the distance between observed and simulated 
summary statistics (see formula (5) in Beaumont et al, 2002). Para- 
meters values are then replaced by their estimates at that point in the 
regression. 
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Keeping the differences between observed and simulated stati- 
stics as the predictor variables in the regression, we consider now the 
posterior probability of scenarios as the dependent variable. Because 
of the nature of the "parameter", an indicator of the scenario num- 
ber, a logit link function is applied to the regression. The local 
aspect of the regression is obtained by taking the same weights as in 
the linear adjustment of parameter values as described in Beaumont 
et al. (2002). Confidence intervals for the posterior probabilities 
of scenarios are computed through the limiting distribution of the 
maximum likelihood estimators. See Annex 1 in Supplementary 
material for a detailed explanation. 

2.4 Quantifying confidence in parameter estimations 
on simulated test data sets 

In order to measure bias and precision, we need to simulate data sets 
(i.e. test data sets) with known values of parameters and compare 
estimates with their true values. In the ABC estimation procedure, 
the most time-consuming task is to produce a large enough reference 
table. However, when such a reference table has been produced, e.g. 
for the analysis of a real data set, it can also be used to quantify bias 
and precision on test data sets as well. 

Measuring bias is straightforward, but precision can be assessed 
with different measures. In DIYABC, the latter include the rela- 
tive square root of the mean square error, the relative square root of 
the mean integrated square error, the relative mean absolute devia- 
tion, the 95 and 50% coverages and the factor 2. See Annex 2 in 
Supplementary material for more details. 



3 DIYABC: A COMPUTER PROGRAM FOR 
POPULATION BIOLOGISTS 

3.f Main features 

DIYABC is a program that performs ABC inference on popu- 
lation genetic data. In its current state, the data are genotypes at 
microsatellite loci of samples of diploid individuals (missing data 
are allowed). The inference bears on the evolutionary history of the 
sampled populations by quantifying the relative support of data to 
possible scenarios and by estimating posterior densities of associa- 
ted parameters. DIYABC is a program written in Delphi running 
under a 32-bit Windows operating system (e.g. Windows XP) and it 
has a user-friendly graphical interface. 

The program accepts complex evolutionary scenarios involving 
any number of populations and samples. Scenarios can include any 
number of the following timed events: stepwise change of effec- 
tive population size, population divergence and admixture. They 
can also include unsampled as well as serially sampled populati- 
ons as in Beaumont (2003). The main restriction regarding scenario 
complexity is the absence of migrations between populations. 

Since the program has been written for microsatellite data, it 
proposes two mutation models, namely the stepwise mutation 
model (SMM) and the generalized stepwise mutation (GSM) model 
(Estoup et al, 2002). Note that the same mutation model has to be 
applied to all microsatellite loci, but these may have different values 
of mutation parameters. 

The historico-demographic parameters of scenarios may be of 
three types: effective sizes, times of events (in generations) and 



admixture rates. Marker parameters are mutation rates and the coef- 
ficient of the geometric distribution (under the GSM only). The 
program can also estimate composite parameters such as 6 = 4N e /j, 
and r = tfi, with N e being the diploid effective population size, 
t the time of an event and fj, the mean mutation rate. Prior distri- 
butions are defined for original parameters and those for composite 
parameters are obtained via an assumption of independence of their 
component prior densities. Priors for historico-demographic para- 
meters can be chosen among four common distributions: Uniform, 
Log-uniform, Normal and Log-normal. Users can set minimum and 
maximum (for all distributions) and mean and standard deviation 
(for Normal and Log-normal). In addition, priors can be modified 
by setting binary conditions (>, <, > and <) on pairs of parame- 
ters of the same category (two effectives sizes or two times of event). 
This is especially useful to control the relative times of events when 
these are parameters of the scenario. For priors of mutation parame- 
ters, only the Uniform and the Gamma distributions are considered, 
but hierarchical schemes are possible, with a mean mutation rate 
or coefficient P (of the geometric distribution in the GSM) drawn 
from a given prior and individual loci parameter values drawn from 
a gamma distribution around the mean. 

Available summary statistics are usual population genetic stati- 
stics averaged over loci: e.g. mean number of alleles, mean genie 
diversity, Fst, (<5/i) 2 , admixture rates ... 

Regarding ABC computations, the program can i) create a refe- 
rence table or append values to an existing table, ii) compute the 
posterior probability of different scenarios, iii) estimate the poste- 
rior distributions of original and/or composite parameters for one or 
more scenarios and iv) compute bias and precision for a given sce- 
nario and given values of parameters . Finally, the program can be 
used simply to simulate data sets in the popular Genepop format 
(Raymond and Rousset, 1995). 



3.2 Two examples of analysis with DIYABC 

3.2.1 Illustration on a simulated data set In order to illustrate 
the capabilities of DIYABC, we take first an example based on a 
data set simulated according to a complex scenario including three 
splits and two admixture events (scenario 1 in Figure 1). The sce- 
nario includes six populations: two of them have been sampled at 
time 0, the third one at time 2 and the fourth one at time 4, the 
last two have not been sampled. Each population sample includes 
30 diploid individuals and data are simulated at 10 microsatellite 
loci. This scenario is not purely theoretical as it could be applied for 
instance to European populations of honeybees in which the Italian 
populations (Apis mellifera ligustica result from an ancient admix- 
ture between two evolutionary branches (Franck et al, 2000) that 
would correspond here to population samples 1 and 4. Furthermore, 
in the last 50 years, Italian bees have been widely exported and sam- 
ple 2 could well correspond to a population of a parental branch 
that has been recently introgressed by Italian queens. This exam- 
ple also stresses the ability of DIYABC to distinguish two events 
that are confounded in the usual admixture scheme: the admixture 
event itself and the time at which the real parental populations in the 
admixture diverged from the population taken as "parental". 

Our ABC analysis will address the following questions: 1) Sup- 
pose that we are not sure that the scenario having produced our 
example data set does include a double admixture and that we want 
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to challenge this double admixture scenario with two simpler sce- 
narios, one with a single admixture (scenario 2 in Figure 1) and the 
other with no admixture at all (scenario 3). What is the posterior pro- 
bability of these three scenarios, given identical prior probabilities ? 
2) What are the posterior distributions of parameters, given that the 
right scenario is known ? and 3) What confidence can we have in 
the posterior probabilities of scenarios and posterior distributions of 
parameters ? 

First, a reference table is built up. Using different screens of 
DIYABC, i) the three scenarios are coded and prior distributi- 
ons of parameters are defined (Figure S3), ii) based on previous 
studies (e.g. Dib et ah, 1994 in Pascual et al, 2007), the Genera- 
lized Stepwise Mutation model is selected and prior distributions 
of mutation parameters are defined (Figure S4), iii) motif sizes and 
allele ranges of loci are set (Figure S5) and iv) summary statistics 
are selected (Figure S6). After some hours, a reference table with 6 
million simulated data sets (i.e. 2 million per scenario) is produced. 



Scenario 1 Scenario 2 Scenario 3 




Fig. 1. First example: the three evolutionary scenarios. The data set used as 
an example has been simulated according to scenario 1 (left). The parameter 
values were the following: all populations had an effective (diploid) size 
of 1,000, the times of successive events (backward in time) were ti=10, 
t 2 =500, t 3 =10,000, t 4 =20,000 and t 5 =200.000, the two admixture rates 
were ri=0.6 and r2=0.4. Scenario 1 includes 6 populations, the four that 
have been sampled and two left parental populations in the admixture events. 
Scenario 2 and 3 include 5 and 4 populations respectively. Samples 3 and 4 
have been collected 2 and 4 generations earlier than the first two samples, 
hence their slightly upward locations on the graphs. Time is not at scale. 

To answer the first question, the rz^=60,000 (1%) simulated data 
sets closest to the pseudo-observed data set are selected for the 
logistic regression and n^=600 (0.01%) for the direct approach. 
The answer appears in two graphs (upper row in Figure S7). Both 
approaches are congruent and show that scenario 1 is significantly 
better supported by data than any other scenarios. 

To answer the second question, scenario 1 is chosen and poste- 
rior distributions of parameters are estimated taking the 20,000 (1%) 
closest simulated data sets, after applying a logit transformation of 
parameter values. Here again, the output is mostly graphical. Each 
graph provides the prior and posterior distributions of the corre- 
sponding parameter (Figure S8). Below each graph are given the 
mean, median and mode as well as four quantiles (0.025, 0.05, 0.95 
and 0.975) of the posterior distribution (Table SI in Supplementary 
material). Since the true values are known, we can remark that some 
parameters are rather well estimated with peaked posteriors such as 
the common effective population size and the two admixture rates 
whilst other including all time parameters suggest that data are not 
very informative for them. Very similar results (not shown) have 



been obtained with 5,000 and 40,000 simulated data sets selected for 
the local linear regression, as well as when using a smaller reference 
table (1 million data sets). 

To evaluate the confidence that can be put into the posterior pro- 
bability of scenarios, 500 test data sets were simulated with each 
scenario and known parameter values (i.e. the same values as those 
used to produce the original data set). Posterior probabilities of the 
three scenarios were estimated as above and used to compute type I 
and II errors in the choice of scenario. Results show that scenario 3 
is always rightly chosen or excluded. Consequently type I error for 
scenario 1 is identical to type II error for scenario 2 and vice-versa. 
For scenario 1, type I errors amount to 0.414 and 0.3 for the direct 
approach and the logistic regression respectively whereas type II 
errors amount to 0.014 and 0.020 (cf Figures S9, S10 and Sll for 
detailed distributions of scenario probabilities). The 500 test data 
sets simulated with scenario 1 have also been used to estimate poste- 
rior distributions of parameters, taking the same proportion (1%) of 
closest simulated data sets as above. Relative biases and dispersion 
measures are given in Table S2 (upper part). It is clear that several 
parameters are biased and/or dispersed, the worst case being that of 
parameter ti. The bias is undoubtedly related to the lack of infor- 
mation in the data, so that point estimates are drawn towards the 
mean values of prior distributions. The effect of prior distribution 
is also illustrated in the lower part of Table S2 which provides the 
same measures, but obtained with different prior distributions for 
effective size and time of event parameters. 

3.2.2 Illustration on a real data set Our second example con- 
cerns populations of the Silvereye, Zosterops lateralis lateralis 
(Estoup and Clegg, 2003). During the 19th and 20th century, 
this bird colonised Southwest Pacific islands from Tasmania. The 
importance of serial founder events in the microevolution of this 
species has been questioned in a study based on a six microsatellite 
loci data set (Clegg et al, 2002). 

Our analysis with DIYABC differs by at least four aspects 
from the initial ABC analysis processed from the same data set by 
Estoup and Clegg (2003). First all island populations are treated 
here in the same analysis whereas, for tractability reasons, inde- 
pendent analyses were made using all pairs of populations. Second, 
the initial treatment was based on the algorithm of Pritchard et al. 
(1999), whereas DIYABC uses the local linear regression method 
of Beaumont et al. (2002), which allows a larger number of statistics 
(see below) and hence makes a better use of data. Third, we have 
chosen here non-informative flat priors for all demographic parame- 
ters. Fourth, because DIYABC is able to treat samples collected at 
different times, we did not have to pool samples collected at diffe- 
rent years from the same island and average sample year collection 
over islands. We hence end up with a colonization scenario invol- 
ving five populations and seven samples, two samples having been 
collected at different times in two different islands (Figure 2). The 
sequence and dates of colonisation by silvereyes to New Zealand 
(South and North Island) and Chatham and Norfolk islands have 
been historically documented. This allows the times for the putative 
population size fluctuation events in the coalescent gene trees to be 
fixed, thus limiting the number of parameters. Our scenario was spe- 
cified by six unknown demographic parameters: the stable effective 
population size (Ns ) and the duration of the initial bottleneck (Db ), 
both assumed to be the same in all islands and potentially different 
effective number of founders in Norfolk, Chatham and South and 
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(Warning ! Time is not to scale.) 
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Fig. 2. Second example: screenshot of the scenario used in the analysis of 
the Zosterops lateralis lateralis data set. In 1 830, Z. /. lateralis colonised the 
South Island of New Zealand (Pop2) from Tasmania (Popl). In the following 
years, the population began expanding and dispersing, and reached the North 
Island by 1856 (Pop3). Chatham Island (Pop4) was colonised in 1856 from 
the South Island, and Norfolk Island (Pop5) was colonised in 1904 from the 
North Island (historical information reviewed in Estoup and Clegg 2003). 
Sample collection times are 1997 for Tasmania (Sal), South and North island 
of New Zealand (Sa2 and Sa3, respectively), Chatham island (Sa4) and Nor- 
folk island (Sa5), 1994 for the second sample from Norfolk (Sa6), and 1992 
for the second sample from the North island of New Zealand (Sa7). Splitting 
events and sampling dates in years were translated in number of generations 
since the most recent sampling date by assuming a generation time of three 
years (Estoup and Clegg 2003). We hence fixed tl, t2, t3 and t4 to 31, 47, 47 
and 56 generations, respectively. 



Parameter 


mean 


median 


mode 


<2u.050 


Qo.950 


st. deviation 


N s 


9,399 


7,446 


4,107 


2,706 


23,007 


6,273 


N F1 


19 


18 


16 


9 


33 


8.7 


N F2 


202 


173 


108 


55 


435 


118 




197 


168 


112 


55 


430 


116 


N Fi 


293 


288 


278 


129 


470 


105 



Table 1. Second example : mean, median, mode, quantiles and standard 
deviation of posterior distribution samples for effective population sizes 
(original parameters) for the Zosterops lateralis lateralis data set. 
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North island of New Zealand (Nfi, Nf2, Nf3 and Nf4, respec- 
tively). As in Estoup and Clegg (2003), we also assumed that all 
populations evolved as totally isolated demes after the date of colo- 
nisation. 

We chose uniform priors (7 [300, 30000] for N s , U[2, 500] for all 
NFi and C7[l, 5] for Db- Prior information regarding the mutation 
rate and model for microsatellites was the same as in the previous 
example. Summary statistics included the mean number of alle- 
les, the mean genie diversity (Nei, 1987), the mean coefficient M 
(Garza and Williamson, 2001), Fst between pairs of population 
samples (Weir and Cockerham, 1984), and the mean classification 
index, also called mean individual assignment likelihood (Pascual 
et al, 2007). We produced a reference table with 1 million simula- 
ted data sets and estimated parameter posterior distributions taking 
the 10,000 (1%) simulated data sets closest to the observed data set 
for the local linear regression, after applying a logit transforma- 
tion to parameter values. Similar results were obtained when taking 
the 2,000 to 20,000 closest simulated data sets and when using a 
log or log — tangent transformation of parameters as proposed in 
Estoup et al. (2004) and Hamilton et al. (2005) (options available in 
DIYABC). 

Results for the main demographic parameters are presented in 
Table 1. They indicate the colonization by a small number of 



Fig. 3. Second example: posterior distributions of the bottleneck severity 
(see definition in text) for the invasions of four Pacific islands by Zoste- 
rops lateralis lateralis. The four discontinuous lines with small dashes, dots, 
dash-dots and long dashes correspond to Norfolk, Chatham, North Island 
and South Island of New Zealand, respectively. The continuous line corre- 
sponds to the prior distribution, which is identical for each island. This graph 
has been made with the locfit function of the R statistical package (Ihaka and 
Gentleman, 1996), using an option of DIY ABC which saves the sample 
of the parameter values adjusted by the local linear regression (Beaumont et 
al, 2002). 



founders and/or a slow demographic recovery after foundation 
for Norfolk island only (median Nfi value of 18 individuals). 
Other island populations appear to have been founded by silvereye 
flocks of larger size and/or have recovered quickly after founda- 
tion. In agreement with this, the bottleneck severity (computed as 
BSi=DB x Ns /NFi) was more than one order of magnitude larger 
for the population from Norfolk than for other island populations 
(Figure 3). These results are in the same vein as those obtained by 
Estoup and Clegg (2003) and agree with their main conclusions. 
Discrepancies in parameter estimation are observed however (e.g. 
larger Ns values and more precise inferences for Nf2, Nf3 and 
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Nf4 in the present treatment). This was expected due to the dif- 
ferences in the methodological design underlined above. With the 
possibility of treating all population samples together, DIYABC 
allows a more elaborate and satisfactory treatment compared to 
previous analyses (Estoup and Clegg, 2003; Miller et at, 2005). 



4 CONCLUSION 

So far, the ABC approach has remained inaccessible to most 
biologists because of the complex computations involved. With 
DIYABC, non specialists can now perform ABC-based inference 
on various and complex population evolutionary scenarios, without 
reducing them to simple standard situations, and hence making a 
better use of their data. In addition, this programs also allows them 
to compare competing scenarios and quantify their relative support 
by the data. Eventually, it provides a way to evaluate the amount of 
confidence that can be put into the various estimations. The main 
limitations of the current version of DIYABC are the assumed 
absence of migration among populations after they have diverged 
and the mutation models which mostly refer to microsatellite loci. 
Next developments will aim at progressively removing these limita- 
tions. 
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